---
title: Matrix multiplication in parallel streams
description: Consider an algorithm for multiplying rectangular matrices using Java Streams. Let's take the optimized version of the algorithm using three nested loops...
sections: [Multithreading,Nested loops,Comparing algorithms]
tags: [java,streams,arrays,multidimensional arrays,matrices,rows,loops]
canonical_url: /en/2022/02/09/matrix-multiplication-parallel-streams.html
url_translated: /ru/2022/02/08/matrix-multiplication-parallel-streams.html
title_translated: Умножение матриц в параллельных потоках
date: 2022.02.09
lang: en
---

Consider an algorithm for multiplying rectangular matrices using Java Streams. Let's take the
*optimized version* of the algorithm using three nested loops and replace the outer loop with
a stream. Let's compare the operating time of two algorithms.

We process the rows of the resulting matrix in parallel mode, and populate each row layerwise. Due to
the parallel use of cache of the execution environment on multi-core machines, the computation time can
be reduced by more than two times. To check, let's take two rectangular matrices {`L×M`} and {`M×N`}.

*Further optimization: [Winograd — Strassen algorithm]({{ '/en/2022/02/11/winograd-strassen-algorithm.html' | relative_url }}).*

## Parallel stream {#parallel-stream}

We bypass the rows of the first matrix `L` in parallel mode. In each thread there are two nested loops:
across the *common side* of two matrices `M` and across the columns of the second matrix `N`. Processing
of the rows of the resulting matrix occurs independently of each other in parallel streams.

```java
/**
 * @param l rows of matrix 'a'
 * @param m columns of matrix 'a'
 *          and rows of matrix 'b'
 * @param n columns of matrix 'b'
 * @param a first matrix 'l×m'
 * @param b second matrix 'm×n'
 * @return resulting matrix 'l×n'
 */
public static int[][] parallelMatrixMultiplication(int l, int m, int n, int[][] a, int[][] b) {
    // resulting matrix
    int[][] c = new int[l][n];
    // bypass the indexes of the rows of matrix 'a' in parallel mode
    IntStream.range(0, l).parallel().forEach(i -> {
        // bypass the indexes of the common side of two matrices:
        // the columns of matrix 'a' and the rows of matrix 'b'
        for (int k = 0; k < m; k++)
            // bypass the indexes of the columns of matrix 'b'
            for (int j = 0; j < n; j++)
                // the sum of the products of the elements of the i-th
                // row of matrix 'a' and the j-th column of matrix 'b'
                c[i][j] += a[i][k] * b[k][j];
    });
    return c;
}
```

## Three nested loops {#three-nested-loops}

We take the *optimized* variant of algorithm, that uses cache of the execution environment better
than others. The outer loop bypasses the rows of the first matrix `L`, then there is a loop across
the *common side* of the two matrices `M` and it is followed by a loop across the columns of the
second matrix `N`. Writing to the resulting matrix occurs row-wise, and each row is filled in layers.

```java
/**
 * @param l rows of matrix 'a'
 * @param m columns of matrix 'a'
 *          and rows of matrix 'b'
 * @param n columns of matrix 'b'
 * @param a first matrix 'l×m'
 * @param b second matrix 'm×n'
 * @return resulting matrix 'l×n'
 */
public static int[][] sequentialMatrixMultiplication(int l, int m, int n, int[][] a, int[][] b) {
    // resulting matrix
    int[][] c = new int[l][n];
    // bypass the indexes of the rows of matrix 'a'
    for (int i = 0; i < l; i++)
        // bypass the indexes of the common side of two matrices:
        // the columns of matrix 'a' and the rows of matrix 'b'
        for (int k = 0; k < m; k++)
            // bypass the indexes of the columns of matrix 'b'
            for (int j = 0; j < n; j++)
                // the sum of the products of the elements of the i-th
                // row of matrix 'a' and the j-th column of matrix 'b'
                c[i][j] += a[i][k] * b[k][j];
    return c;
}
```

## Testing {#testing}

To check, we take two matrices `A=[900×1000]` and `B=[1000×750]`, filled with random numbers.
First, we compare the correctness of the implementation of the two algorithms — matrix products
must match. Then we execute each method 10 times and calculate the average execution time.

```java
// start the program and output the result
public static void main(String[] args) {
    // incoming data
    int l = 900, m = 1000, n = 750, steps = 10;
    int[][] a = randomMatrix(l, m), b = randomMatrix(m, n);
    // matrix products
    int[][] c1 = parallelMatrixMultiplication(l, m, n, a, b);
    int[][] c2 = sequentialMatrixMultiplication(l, m, n, a, b);
    // check the correctness of the results
    System.out.println("The results match: " + Arrays.deepEquals(c1, c2));
    // measure the execution time of two methods
    benchmark("Stream", steps, () -> {
        int[][] c = parallelMatrixMultiplication(l, m, n, a, b);
        if (!Arrays.deepEquals(c, c1)) System.out.print("error");
    });
    benchmark("Loops", steps, () -> {
        int[][] c = sequentialMatrixMultiplication(l, m, n, a, b);
        if (!Arrays.deepEquals(c, c2)) System.out.print("error");
    });
}
```

{% capture collapsed_md %}
```java
// helper method, returns a matrix of the specified size
private static int[][] randomMatrix(int row, int col) {
    int[][] matrix = new int[row][col];
    for (int i = 0; i < row; i++)
        for (int j = 0; j < col; j++)
            matrix[i][j] = (int) (Math.random() * row * col);
    return matrix;
}
```
```java
// helper method for measuring the execution time of the passed code
private static void benchmark(String title, int steps, Runnable runnable) {
    long time, avg = 0;
    System.out.print(title);
    for (int i = 0; i < steps; i++) {
        time = System.currentTimeMillis();
        runnable.run();
        time = System.currentTimeMillis() - time;
        // execution time of one step
        System.out.print(" | " + time);
        avg += time;
    }
    // average execution time
    System.out.println(" || " + (avg / steps));
}
```
{% endcapture %}
{%- include collapsed_block.html summary="Helper methods" content=collapsed_md -%}

Output depends on the execution environment, time in milliseconds:

```
The results match: true
Stream | 113 | 144 | 117 | 114 | 114 | 117 | 120 | 125 | 111 | 113 || 118
Loops | 1357 | 530 | 551 | 569 | 535 | 538 | 525 | 517 | 518 | 514 || 615
```

## Comparing algorithms {#comparing-algorithms}

On an eight-core Linux x64 computer, we create a Windows x64 virtual machine for tests. All
other things being equal, in the settings, we change the number of processors. We run the
above test 100 times instead of 10 — we get a summary table of results. Time in milliseconds.

```
   CPU |   1 |   2 |   3 |   4 |   5 |   6 |   7 |   8 |
-------|-----|-----|-----|-----|-----|-----|-----|-----|
Stream | 522 | 276 | 179 | 154 | 128 | 112 | 110 |  93 |
Loops  | 519 | 603 | 583 | 571 | 545 | 571 | 559 | 467 |
```

Results: with a large number of processors in the system, the multithreaded algorithm works
out noticeably faster than three nested loops. The computation time can be reduced by more
than two times.

All the methods described above, including collapsed blocks, can be placed in one class.

{% capture collapsed_md %}
```java
import java.util.Arrays;
import java.util.stream.IntStream;
```
{% endcapture %}
{%- include collapsed_block.html summary="Required imports" content=collapsed_md -%}
